DOCOHENT BESOHE 



ED 194 616 TH BOO 747 



ftOTHOB Brossellr _Gordpn 

TITLE Vaiidation_of Topics and Comparisons of Three 

Presentation Mcdes for the Writing Subtest of the 
Florida Teacher Certification Examination; Volume 
Five of Five; 

INSTiTOTldN Florida State tlhiv,, Tallahassee; Coil; of 

Educationi 

SPONS AGENCY Florida State Depti of Educatibnr Tajiahasseew 
POB DATE Mar 80 

NOTE 29p. : For related documents^ see TM BGQ 743-46* 



EDHS PSICE MF01/PC02 Plus Postage. 

DESCHIPTOHS Competency Based.Teacher Education; Education Majors; 

Elementary Secondary _ Education: *Essay Tests; Higher 
Education ; *Minimum Competency Testing; Student 
Evaluation: Teacher Certification; *Test Validity; 

r writing (Compdsitioni : *Hriting Skills 

IDENTIFIEHS *Essay Topics ; Florida; Florida Teacher Competency 

Exaainaticn: Intexrater Reliability ; *Presentation 
Mode: writing EitaTuaticn 



ABSTHACT 

The hypothesis that rcle-playi 
full rhetorical contexts were a superior means of eliciting ya 
writing samples for t he_pur go se_of assessing compo si tionai_ski^ of 
prospective Florida teachers_was tested; _In_additionr topics for use 
on the initial administrations of the Florida Teacher Competency 
Examination's writing subtest were develcped and validated. Six essay 
topics selected by a panel of validators were cast in three different 
presentation modes according tc degree of _ specification of rhetorical 
contextr or "informational load ilL)." A low IL presented the topic 
ih_a brief phrase leaving the writer free to make decisions about 
audience^ purpose^ fornix and tone without guidance. A moderate IL 
gave the writer aii prientatidn to _the task without including 
specifications. A high IL gave a full rhetbricalcontext. The six 
topics in the three mddes^ 13 combinations in all, were administered 
randciiily to a sample of college education ma3ors. Evidence gathered 
in statistical and rhetorical analyses points clearly at the moderate 
IL as the presentation mode most likely to stimulate the best writing 
of large number of examinees. (PL) 



* Eeprdductibhs supplied by EDRS.are the best that can be made ' * 

* from the original document. * 

:|t aP 9fe 9P 9^ a|c apsfc ifcsfc 

^ 

o 

ERIC 



SCOPE OF INTEREST NOTICE 
The ERIC Fadlity.bMLaiiltflid 

- ^1 




In our jud^mmt, thijLClnoimif at 
is^lio-of4m«reft to tbt ciMring- 
f»Mww npttd to thf i'ighS..lndBx> 
ing should reflect their tpedal 
points of view. 


cs 



U.S_OEPARTMENT Of HEALTH. 
EDUCATION ft WELFARE 
NATlONALmltltUtE OF 
EDUCATION 



VALIDATION OF TOPICS AND COMPARISON 
OF THREE PRESENTATION MODES FOR THE WRITING SUBTEST l^^'i^^^r^^'posmiN'^o*. 
OF THE FLORIDA TEACHER CERTIFICATION EXAMINATION 



T H I S DQCUM E N I K A S .8 E EN. R E PRO- 
DUCED EXACTLY AS RECEIVED PROM 
THE PERSON OR ORGANJZAttON OR LOIN- 
AT^NG IT POINTS OP VIEW OR OPlNtpNS 
STATEO bb NOT NECESSARILY REPRE- 
L INSTITUTE OF 
OR POLICY 



Gordon Bros sell 

College of Education 
Florida State University 



VOLUME FIVE OF FIVE 



Under Contract to the Department of Teacher Education 
Florida State IDepartmeht of Education 

#080-064 

"PERMISSION TO REPRODUCETKJS 

MATERIAJ^AS BEEN GRANTED BY ^^^^^^^^ EdUCatiOH 



March, 1980 




to the iducational resources 
Information center (ERIC)." 



"V 



TABLE OF CONTENTS 



Section Page 

Purpose o£ the Study 1 

Design of the Study 3 

Overview. 3 

Topic Selection 5 

Essay Field.Trial 6 

The Rating Team 8 

Rater Training 8 

Essay Ratings: The Data 9 

Interrater Reliability H 

Analysis of the Data 13 

A Rhetorical Perspective on the Essays 22 

A Perspective oh Modes and Topics 26 

Recommendations 29 

Mode 2S 

Topic _ 29 

Rhetorical Modifications 29 

Security Measures '29 

Appendixes 31 



3 



ERIC 



Pdxpose of i:he -Study 

Despite a considerable body o£ professional opinion, there is 
little empirical evidence . that shows the effects of essay topics and 
their mode of presentation oh writers; it is a widely held notion 
among compos it ion teachers that specification of rhetorical 
context- - that is, the identification for a writer of the purpose 
of a piece of writing, its intended audience, its subject, its 
form, and its "voice" --will enable him or her to understand more 
fully the demands of a particular writing task and so to produce a 
more fully realized^ coherent piece of work. Some researchers have 
passed beyond speculation about the necessity for rhetorical 
specification, preferting instead to ask questions about the impact 
of certain elements within a full writing context rather than about 
the effects of the context itself. Yet when it comes to demonstrating 
these effects, to showing how changes in the mode of presentation 
of an essay topic result in variations in the quality of an essay 
written oh it, there is but scant data available upon which to decide 
how best to design a valid test of writing competence. 

That is the question. How to devise an essay examination 
whose topics and whose presentation modes will offer a fair test 
of compositional skill to a population of thousands of examinees, 
in ah attempt to ahswer this question, and to cut a path through the 
jungle of research reports, testimonials, reviews, commentaries, 
and self-styled theories which comprise the relevant literature on 
the subject, James Hoetker undertook Volume IV of the Writing 
Subtest Handbooks, **0n Writing Essay Topics for a Test of the 
Composition Skills of Prospecti\'e Teachers." In it he put forth the 

4 

o 

ERIC 



recbmmendatioh that topics for the Florida Teacher Certification 
Examination should indeed present full rhetorical contexts to the 
writers and that furthermore they should be based on situations that 
prospective teachers would be apt to face in the real world. The 
particular form he advocated was the hypothetical situation, or 
scenario, that is predicated on a variety of roles typically played 
by practicing teachers. Despite careful, extensive research^ 
logically sound reasoning, and a convincing argument^ Hoetker's 
recommendation had to be qualified by a lack of empirical data: 

_ _ In the_absence of the heeded experimental evidence, 
and in the presence of the possibility that partial 
specification of context might make. a dif f erence , the 
safest course seems to be not to take chances , arid to 
produce topicsp". • • that give full specification of the 
class of discourse that has .been demonstrated to have 
direct pertinence to a teacher's job.l- 

It remained to test the hypothesis that role-playing scenarios 
specifying full rhetorical contexts were a superior means of eliciting 
valid writing samples for the purpose of assessing the compositional 
skills of prospective Florida teachers. The current study did just 
that . 

In addition, the study sought to develop and validate a 
minimum of four topics for use on the initial administrations of the 
writing subtest. The validation of topics for writing examiriations 
is neither a standardized nor a widely piracticed procedure and has 
been used consistently orily by professional testing organizations 
that administer writing exams regularly^ such as the Educational 
Testing Service and College Entrance Examination Board. Essentially, 

1. James Hoetker, "On Writing Essay Topics for a Test of the 
Composition Skills of Prospective Teachers,*' Florida State Department 
of Education, 1979, p. 76. 

s 

& 

ERIC 



validation is a process whereby a verbally expressed writing 

♦ 

stiinulus--a tbpie--is certified by a hujhber of experienced reviewers 
to be free from the kinds of rhetorical, structural, and psychological 
biases which might otherwise affect a writer. The process normally 
includes, as it did in this study^ a series of critical reviews ^ 
emendations > and editions of topics^ as well as a trial of them 
under conditions similar to those that will exist in the actual 
examination. It is--and this process was--aimed at attaining a 
high level of agreement among expert consultants about the possible 
impact of writing stimuli and is in keeping with the subjective but 
rigorous and criteria-oriented procedures that mark good programmatic 
writing assessment. 

Design of the Study 

Overview^ . Six essay topics were selected from a list generated by 
a panel of validators specially chosen by the Investigator. Each 
topic was cast in three different presentation modes according to 
degree of specification of rhetorical context, or "informational 
load." The first mode presented the topic in a brief phrase only, 
leaving the writer free to make decisions about audience, purpose, 
form^ and tone without guidance of any kind. This mode^ Mode 1^ 
was said to contain low informational load. Mode 2, characterized 
by moderate informational load, presented a general introductory 
statement about the topic and then asked the writer to state his or 
her own views on it, giving the writer an orientation to the task 
but leaving specifications aside. Mode 3 described a hypothetical 
situation placing the writer in the position of having to state 
personal views on a subject as in Mode 2, but this time in a full 
rhetorical context- - that is, with an identified audience, a specific 



EKLC 



form, and a stated :6r clearly implied purpose; The writer was thus 
given a point of view, or rheltdrieal stance, from which to write, 
though the substance of the piece^-the actual views expressed- -was 
always left to his or her personal disposition. This mode was said 
to be marked by high informational load. 

The six topics in the three modes , eighteen combinations in all, 
were then administered randomly in an essay examination trial--bne 
to each writer--to a sample population of undergraduate education 
majors at two state universities, Florida State University in 
Tallahassee and the University of South Florida in Tampa^ under 
test conditions closely resembling those anticipated for the first 
administrations of the actual writing subtest. The resulting 
papers were collected and screened by the investigator for degree 
of commitment to the task (the writers were asked to take the test 
seriously though they knew, of course, that it would have no real 
consequences for them), and those few found to exhibit clearly 
insufficient effort were removed. Twenty essays were retained in 
each mode of each topic, or "cell^" leaving a total of 560 in the 
sample population. 

The essays were read and rated by a panel of three raters who 
had undergone training in holistic evaluation of writing samples 
conducted by the investigator using the Training Manual (Volume II 
of the Writing Subtest Handbooks) developed for this purpose. EssaVs 
receiving discrepant scores were read and rated by a referee whose 
ratings replaced the discrepant ones according to the procedures 
established in Volume III of the Handbooks . The resulting final 



EKLC 



7 



scores b£ the essays were then entered^ together with their 
estimated lengths, into the FSU ebmputer for statistical analysis. 

Topic Se l ecti on . the investigator recruited three experienced 
teachers of written composition to serve as validators: Barbara Ash, 
a second-year doctoral candidate in English Education at FSU and a 
former high school English teacher, who also served as the study's 
chief administrative assistant; Linda Clarke, an English teacher at 
Lincoln High School in Tallahassee who holds a master's degree in 
English Education from FSU; and Pamela Laws, a composition instructor 
at Tallahassee Community College who also holds an FSU master's 
degree in English Education. 

The panel met several times in the fall of 1979, completing its 
work in mid-October. initially, the validators were asked to 
generate a working list of possible topics using a set of criteria 
adapted in part from Volume IV of the Handbooks. Topics meeting 
these criteria were thought by the validators to be: 

13 self-explanatory (i.e., clearly and explicitly phrased); 

2} defined and limited; 

33 familiar to every examinee; 

43 stimulating; 

53 fresh • 

63 of middle- emotional ground (i.e., neither too pedestrian 

nor too sensational); 

73 nonbiased and nonbiasing. 

From an initial list of more than twenty possible topics, the panel, 
after deliberating the potential effects of each, selected eight it 
felt met the criteria in each of the three presentation modes.. 
These eight topics in each mode were then sent for review and comment 
to the consultants to the Writing Subtest Handbooks, Dr. Nancy 
McGee of the Department of Secondary Education at the University of 
Central Florida in Orlando and Professor Dan Kelly of the English 



EKLC 



Department at the University d£ Florida in Gainesville; Their 
responses to the topics together with further deliberation by the 
panel resulted in the elimination of two of the eight topics, 
leaving six judged valid for the essay trial. 

Es saj^ El^ id- Trial . During November, 1979, the topics were administered 
to a sample population of undergraduate education majors in an essay 
field trial. Students in professional education classes at two 
universities, FSU and USE, took the tests, which were administered 
by the investigator and his assistant at ESU and by Dr. Annie Ward, 
technical consultant to this study, at USE. Test conditions were 
similar to those anticipated for the first administrations of the 
actual examination: .directions were^printed on the cover sheet 

of each test packet, which contained 
blank lined paper for the examinees' use; and a period of 45 minutes 
was allowed in which to complete the exam. No choice of topic or 
mode was offered, to insure that the proper number of essays in 
each cell could be obtained. The test administrators announced the 
purpose of the field trial at the beginning of each class period in 
which the testing was conducted. Th^:- examinees thus had no fore- 
knowledge of the test, a condition which helped to insure their 
attendance in numbers adequate to collect sufficient essay samples. 
All three test administrators reported that this condition had no 
apparent effect on the attitudes of the writers toward the test, and 
the essays themselves made no mention of it. 

A total of 36D essays comprised the final sample. Of these, 
190 were written by students at ESU and 170 by students at USE. 



EKLC 



9 



theris were 294 females in the sample and 63 males; three additional 
writisrs failisd to identify themselves. The students represented a 
large variety of majors, about 30 in all^ though the largest numbers 
of them came from programs in elementary education (including early 
childhood education and child development) and physical education- - 159 
in the former case and 55 in the latter. The sample was thus 
roughly proportional to the distribution of academic majors currently 
being prepared in Florida teacher education programs. A breakdown 
of the sample population by academic major follows. 



FSU 



USE 



Ma j or 

Physical Ed. 
Special Ed. 
Husie Ed. 
Elementary Ed. 
English^ Ed. 
Speech Pathology 
Social Studies Ed. 
Home Economics Ed. 
Art Ed. 

Mathematics Ed. 
Social Work 
Early Childhood Ed. 
Child Development 
Science Ed. 
Visual Disabilities 
Library Science 
Vocational /Bus iness 
Industrial Arts Ed. 
Foreign Language Ed 

Theater Ed. 

Political Science 
ESL 

Psycho logy 
Art Therapy 
Career Ed. 



t In^amp l e 

55 
21 
17 
17 
13 
ID 

7 

6 

5 

5 

5 

4 

4 

3 

3 

3 

Ed. 2 
2 
2 



Maj^>r 

Elementary Ed^ ^ 
Early Childhood Ed. 
Learning Disabilities 
EMR 

EiMH : 

English Ed . 
Gifted Ed . 

Foreign Language Ed. 
Deaf Ed. 



il in Sample 



98 
36 
16 
8 
6 
2 
2 
1 
1 

170 



190 



10 

o 

ERIC 



The Ra t^iag- Jeam . two of the validators, Linda Clarke and Pamela 
taws, also served as raters of the essays; a third rater, Carol Gray, 
ah experienced teacher of composition and a member of the English 
department at Leon High School in Tallahassee, joined the rating 
team in December. Dr. Dwight Burton , Professor of English Education 
and chairman of the Department of Curriculum § Instruction in the 
College of Education at FSU^ agreed to serve as referee, whose task 
is to read and rate essays receiving discrepant ratings. (See 
Volume III of the Handbooks, p. 27.-36, for a full treatment of this 
procedure.] These people comprised a first-rate holistic scoring 
team, meeting the requirements of professional experience and 
technical knowledge imposed by the study and, in the investigator's 
opinion, surpassing . the degree of competence that might be expected 
of a typical team rating essays written in the actual subtest. 

Ra ter Train ing^ . In early December, the rating team (excispt for 
Dr. Burton, who had served in a similar capacity before and who was 
thoroughly familiar with his referee's role) underwent initial 
training in holistic scoring of essays. The training session, 
which occurred on a Saturday oh the FSU campus, was conducted by 
the investigator, aided by Barbara Ash, the administrative assistant, 
using Volume II of the Handbooks, the Training Manual. The raters 
spent roughly the first half of the session oh the materials and 
procedures called for in the manual - -practicing holistic rating 
with the appropriate criteria and rating guides, and attempting to 
reach a high level of consistency in their rat ing of the same essays . 

When the formal training was completed, a check was made to 
determine the level of interrater reliability, or consistency, 

ii 

o 

ERIC 



achieved in the training session; Thirty of the essays written on 
the field trial were read and rated independently by each rater in 
three packets o£ ten. Their ratings were then analyzed in terms of 
the four indexes described in Volume I of the Handbooks- -percentage 
of complete agreement among raters, average percentage of two of 
three raters agreeing^ average percentage of agreement by pairs of 
raters as to whether an essay passes or fails, and percentage of 
complete agreement . about whether an essay passes or fails. The 
following table shows how the reliability levels achieved by the 
rating team compared with the target levels established in Volume I- 

Raters ' Level target Level 

index 1--% Gomplete Agreement 40 30-40 

Index 2--Average % Two of Three 96.7 80-90 

Raters Agreeing 

Index 3--Average % Agreement by 82.2 80-90 

Pairs as to Pass/Fail 

Index 4-- % Complete Agreement 73.3 7.0-80 

about Pass /Fail 

In indexes 1, 3, and 4, the raters' reliability levels fell within 
the desired ranges; in index 2, their level exceeded that of the 
target range. (Only one of the thirty essays read in the reliability 
check received a discrepant set of ratings and needed subsequently to 
be read by the referee.) The investigator thus had convincing 
evidence that the training session had been successful and that the 
rating team had achieved a level of reliability sufficient to sustain 
a high degree of confidence in their ratings of field trial essays. 

Essay Ratings: the Data 

After initial training, the raters were given the task of 



rating essays written on the field trial. All the essays were 
read and fated independently by each rater under conditions (the 
work was done for the most part in the raters' homes) as similar 
as is possible to obtain when raters are not gathered in one place, 
as they were for the initial training session. When the ratings 
were submitted, they were reviewed and those essays receiving 
discrepant ratings--72 in all, or 20% of the total number--were 
given to the referee. His ratings were substituted for the discrepant 
ones, and the scores of all 366 essays were finalized. The following 
table summarizes the results of the essay field trial. 



Score 


N 


% of N 


3 


37 


10.3 


*4 


0 


0 


5 


100 


27.8 


6 


71 


19. 7 


7 


76 


21.1 


8 


47 


13. 1 


9 


13 


5.6 


10 


13 


3.6 


11 


2 


. 6 


12 


1 


. 3 


Total 


360 


100 



Mean Score --6. 19 
Median Score- -6 
Modal Score- - 5 



^Initially there were 23 scores of 4. Of these, 10 became 3's 
and 13 became 5's--the result of the substitution of the referee's 
ratings, in each case, for a 1 or a 2 in the ratings distribution 
comprising a score of 4 (1 1 2). 



13 

o 

ERIC 



In early January, the investigator, having secured the assistance 
of the Office of User Services at the FSU Gdmputihg Geriter, entered 
the ratings of the essays, together with their estimated lengths,"'"' 
into the computer, the resulting data file becoming the basis of 



the statistical analysis that followed, 
I n t err at er -^^1 i ab i i i ty 

The level of reliability achieved by the rating team in rating the 
field trial essays was measured, using the four indexes described on 
page 9. The resulting figures reflect the referee's ratings, unlike 
those of the training session where a referee was not involved; thus 
they represent one final measure of the team's reliability as essay 
raters. The following table shows how the team's rating effectiveness 
cbulpared with the target ranges in the four indexes of consis- 
tency. The figures in parentheses are those the team attained in 
the initial training session arid are supplied for the sake of 
cdmparisdri with the levels of reliability it achieved in the entire 
essay trial. 



1. Lengths were estimated according to the followirig procedure: 
every tenth line of each essay was given a_ word cburit; the sum of 
these counts was divided by the riumber of lines counted to get an 
average riuuiber of words per lirie^ which was then multiplied by the 
riumber of liries iri the whole essay. The resulting product was the 
estimated length of the essay. To insure word counts accurate_ ^ 
enough to be meariirigful, a check was run against the actual word 
cpunt in 60 essays. In one set of 30 essays , es timated and actual 
word counts differed by 8 .5% on the average with twelve cases in 
which differences exceeded 10| ,a_tolerable margin of error in this 
kind of estimate. The_average_number- of words per essay differed 
by onlyten from estimated to actual count, however. In another set 
of 30 essays, the average difference between the estimated and actual 
word counts was_7. 351 with only six cases_th which the 101 margin _ 
of error was exceeded^ and the average word_count per paper differed 
by only ten words from es timated^ count . This check indicated 

5^?^?^? estimatedwordcount was accurate enough for inclusion in 
the statistical analysis. _ (Dr. Tom Denmark, Professor of Mathematics 
Education at FSU, rendered advisory assistance on the word-count 
procedure . ) 



14 



-bevel Targ^t^i^ey^i 



Index 1--% Complete Agreement 32.2 (40) 30-40 

index 2- -Average % Two of Three 98.3 (96.7) 80-90 
Raters Agreeing 

Index 3--Average I Agreement 5y 81.3 (82.2) 80-90 
Pairs as to Pass/Fail 

Index 4--% Complete Agreement 71.7 (73.3) 70-80 
about Pass/Fai 1 



On three of the four indexes the rating team's level of consistency 
fell within the target ranges; in one case. Index 2, the team's 
level exceeded not only the target range but also the level it had 

achieved in the training session. In indexes 1, 3, and 4, the small 

- J _____ 

dropoff from the training session levels to the field trial levels 

is in all likelihood a result of the tenfold increase in the number 

of essays read> and was hardly unexpected. 

In addition to the four indexes, a coefficient of interrater 

reliability was obtained for pairs of raters and for the rating 

team both before and after the substitution of the referee's ratings. 

Known as the Alpha coefficient, it is in simplest terms a statistical 

indication of the expected correlation between the ratings of the 

team on this task and those of a hypothetical team of similarly 

comprised and similarly trained raters doing the same task. The 

following table shows the Alpha coefficients for the rating team. 

Without Referee ' s Ratings With Referee's Ratings 

Raters 1 § 2 .619 .640 

Raters 1 5 3 .720 .799 

Raters 2 5 3 .686 .815 

Raters 1^ 2^ and 3 .759 .828 



15 

ERIC 



the figures reflect the effect of the referee's ratings on the 
team's between- rater consistency, increasing the level of reliability 
in every instance and increasing it substantially in some. The 
most important coef f icieht - - that of raters 1, 2, and 3 (i.e.^ the 
whole team) with the referee's ratings^-is, as would be expected, 
the highest I since the reliability- of a group of trained raters 
generally increases as its number increases and since the sub- 
stitution of a referee's ratings is, in and of itself, a deliberate 
upward adjustment in interrater reliability. The level of reliability 
achieved by the rating team is, in the judgment of the investigator, 
sufficiently high to justify firm reliance on the data yielded in 
the essay field trial. ^' 

Analysis of the Data 

The purpose of the statistical analysis was to determine the 
extent to which the final scores of essays written on the field 
trial depended oh three f actors -- topic , mode, and length. Toward 
this end, two statistical operations were undertaken: analysis of 
variance (ANOVA) and, later, a multiple regression analysis 
(including a scattergram) of the effect of length on score. 

Three separate analyses of variance were run on the SPSS 
program at the FSB eomputing Gehter. The first ANOVA tested for 
the effects and interactions of topic and mode only, ignoring 
length. The second ANOVA processed length with the main effects of 
topic and mode, in effect treating all three factors equally. In 
the third ANOVA, length was treated as a covariate, arid the effects 
of topic and mode were corrected for the effects of length. 

The following table summarizes the statistical data for the 
first ANOVA, 

ERIC 



Source of Variatibri Sum of Squares DF Mean Square F Significance of F 



Main Effects 18.183 7 2;598 .853 .544 

Topic 6,167 5 1.233 .4(35 .845 

Mode 12.017 2 6.688 1.974 .140 

Two-Way Interactions 36.817 10 3.682 1.210 .283 

Topic Mode 36.817 10 3.682 1.210 ;283 

Explained 55.000 17 3.235 1.063 .389 



There were no statistically significant effects or interactions of topic 
and mode on score, but the effect of mode was clearly much stronger 

_L _ .to t.^h ^ f^rr^if 6/ ^ 

than the effect of topic, a€±Li^iiMrg--srat i-s-Moa-l- 

In the second ANOVA, length was entered into the equation along 
with topic and mode; /its statistical summary follows. 

Source of Variation Sum of Squares DF Mean Square F Significance of F 



Main Effects 295.983 8 36.998 16.042 .001 

Topic 11.817 S 2.363 1.025 .403 

Mode 3.235 2 1.617 .701 .497 

Length 277.800 1 277.800 120.454 .001 



TworWay interactions 13.475 10 1.347 .584 .827 

Topic Mode 13.475 10 1.347 .584 - .827 



Explained 309.458 18 17.192 7.454 .001 



Once again the main effects and the interaction of topic and mode were 
insignificant^ but the effect of length on score was significant 
at the .001 level. 

In the third AN6VA, length was held constant in assessing the 
effects o£ topic and mode. Under these condi t ions - -with length 
treated as a covariate- - the AN0VA produced the following data. 



17 



Source of Variation Sum of Squares r)F Mean Square F Significance of F 

Main Effects 18.183 7 2.598 1.126 .346 

)ic 6.167 5 1.233. .535 .750 



Mode 12.017 2 6.008 2.605 .075 
Covariat? 

Length 277.800 1 277.800 120.454 .001 

Two-Way Interactions __ _ _ _^ 

Topic X Mode 13.475 10 1.347 .584 .827 



Explained 309.458 10 17.192 7.454 .001 

This time length was again significant at the .001 level, arid the 
effects of topic and mode were again insignificant; but the effect 
of mode approached significance at this .05 levisl. That is, when 
the effects of topic and mode were adjusted for the effects of 
lengthy mode was significant at the .075 level. 

To help identify the nature of the significant correlation 
between length and scdre^ a multiple regression analysis^ together 
with a scattergram, was run. This analysis, the statistical summary 
of which follows below, revealed a correlation of moderate statistical 
significance between them, and the scattergram (not reproduced 
here) showed the correlation to be curvilinear in nature and of 
the following order: up to and including a score of 9, the mean 
length of the essays gradually increased; after 9, mean length 
varied widely (with the number of writers scoring higher than 9 
greatly diminishing), and the correlation broke down. A table 
showing mean lengths (including standard deviations arid variance 
coefficients) by score follows the multiple regression analysis 
summary below. 



IS 

o 

ERIC 



Multiple Regression Analysis Summary 

F to Signifi- Multiple R R Square R Square Simple R Overall F Signifi 

Enter or cance Change cance 

Remove 

122 . 921_3A ^_0Q0 ^0564 . 25567 . 25567 . 50564 122.97136 ^Dee 



Mean Lengths by Score 



Score Mean Length Standard Deviation Variance Coefficient N 





(Words) 


(Words) 






3 


208 


112 


13049 . 5088 


37 


4 








6 


5 


260 


89 


7905.0961 


100 


6 


315 


88 


7688. 5968 


71 


7 


342 


83 


6864.4370 


76 


8 


394 


138 


18972.3784 


47 


9 


418 


92 


8436.0897 


13 


Id 


374 


102 


10409.3333 


13 


11 


405 


9 ' 


84. 5000 


2 


12 


493 


0 


9 


1 


Total 


312 


114 


13049 . 5088 


360 


The 


correlation between 


length and scores 


3 through 9 


suggests 



that essays of certain lengths were more likely to get certain scores. 
Indeed in the table above, the mean length of essays varies directly 
with increasing scores up through 9- -scores which account for more 
than 95% of the essays in the sample. It can be said then that the 
longer one's paper was--or, more specifically, the greater its length 
in the range of roughly 200-400 words--the more apt it was to get a 
higher score, at least up to a score of 9. 

The level of significance of mode, particularly when adjusted for 
the effects of length, suggests that mode might too have affected the 
scores of essays, though that conclusion appears to be considerably 
more tenuous than that adducing length as a decisive factor. Mode, 



19 



after all, was not a statistically significant factor on essay scores, 
unless one wishes to argue that the .075 level attained in the third 
analysis of variance is, in this study, significant. There is ho 
basis for doing so, however. What does appear to be a sensible 
inference is, rather, that mode is correlated with score by way of 
length; that is, a certain mode produced a tendency toward higher 
scores by virtue of having stimulated examinees to write essays of 
greater length. A look at mean scores and mean lengths by modes in 
the following table reveals the beginnings of a case for this line 
of reasoning. 

N Mean Score SB Mean tehgth SD 

Mode 1 128 6.14 ,1.74 32Q 123 

Mode 2 120 6.43 1.77 326 101 

Mode 3 128 S.98 1.70 .288 115 

Total 369 6.19 1.74 312 114 

Mode 2 essays scored higher on the average and were of greater 
average length than essays written in the other two modes. In 
addition, its standard deviation for length was a bit lower than 
those for the other two modes and for the entire sample, and its 
standard deviation for score was virtually the same as that of the 
others and of the entire sample's. To be sure, the differences are 
not dramatic, but one hardly expects dramatic differences in a 
study whose sources of variation are themselves notably subtle, 
the magnitude of the variation among the three modes is relatively 
small to begin with, especially given the careful attention to 
selection of topics, the controlled conditions of the essay trial, 
and the relatively large size of the sample population. So a 
difference in mean score of nearly a half point, for example- -the 
difference between the mean scores of Mode 2 and Mode 3 essays-- is 

20 



not a negligible one, particularly i£ it can reasonably be attributed 
to specific, deliberate changes in the wording of topics. And while 
the differences in the mean lengths of the modes are iquite small^ 
the fact remains that Mode 2 writers exhibited a tendency to write 
slightly longer essays and^ in so doing, to achieve somewhat higher 
scores . 

A similar comparison of mean scores and mean lengths by topics 
and by topics within modes lends greater weight to the argument that 
mode is correlated to score by way of length. The table below and 



that which follows present these data. 







N 


Mean Score 


sb 


Mean Length 


SD 


Topic 


1 


60 


6.-: 35 


1.96 


310 


118 


tt 


2 


60 


6. 25 


1. 59 


335 


127 


II 


3 


60 


6.22 


- 1 . 54 


293 


133 


It 


4 


60 


6.28 


1 .68 


299 


104 


It 


5 


60 


5.97 


2.08 


311 


97 


It 


6 


60 


6.05 


1. 58 


321 


102 


Total 




360 


6.19 


1. 74 


312 


114 



The figures here are inconclusive. Mean scores for all topics 
except #5 ^^^^^^^^^^^/^//^/(f cluster within a range of . 3 of 
a rating point; mean lengths do not vary proportionately with 
increasing mean scores. There is no discernible pattern in the 
standard deviations of either measure. A view across topics, in 
other words, seems to bear out the insignificant effect of topic 
on score revealed in the three analyses of variance. The view across 
topics within modes, however, is more instructive. 



21 



N 



Mean Score SD 



Mean Length SD 



Mode 1 

topic i 
2 

3 

" 5 
•* 6 



20 
20 
29. 
29 
20 
20 



6.45 
6.30 
6. 39 
6. 20 
5.95 
5.65 



37 
42 
49 
82 
73 
53 



332 
347 
327 
309 
301 
303 



152 
119 
151 
117 
95 
99 



Mode 2 

Topic 1 
2 
3 

" 4 
5 
6 



20 
20 
20 
20 
20 
20 



6 
6 

6, 
6 
6 
7. 



30 
20 
80 
15 
10 
05 



59 
77 
64 
59 
38 
61 



289 
335 
316 
301 
337 
381 



95 
183 
117 
111 
80 
82 



Mode 3 



Topic 1 
2 
3 
4 
5 
6 



Total 



20 
20 
20 

2a 

20 
20 

360 



6 

6 

5. 

6, 

5, 

5. 



39 
25 
55 
50 
85 
45 



6.19 



92 
65 
28 
76 
18 
19 



1 . 74 



308 
324 
236 
289 
295 
278 

3.12 



101 
158 
115 

87 
111 

99 

114 




Here the figures for Topics 6 and 3 

ll^^^mP^llim^^yp in Mode 2 stand out: the mean scores are the 
highest in the entire breakdown, as is the mean length for Topic 6. 
Clearly the presentation of these two topics in this mode had an 
influence on writers such that they wrote essays of greater mean 
length and thus of higher mean score- -to ah extent that these are 
the only two categories of essays in the statistical breakdown that 
can uhqualif iably be referred to as above average in quality. 
Topic 6 essays in Mode 2 also had a low standard deviation for mean 
length, suggesting that the effects of this version of this topic 
were more consistent from writer to writer than were, with one 



22 



exception, all other combinations. The tendency exhibited by 
Topics 6 and 3 in Mode 2 is particularly interesting when a comparison 
is made or their mean scores and mean lengths with those o£ the other 
two modes o£ those two topics. In Modes 1 arid 3 of Topic 6^ mean 
scores were only 5.65 and 5.45, and mean lengths were 303 and 278 ^ 
respectively; and in Modes 1 and 3 b£ Topic 3, mean scores were 
6.30 and 5.55, while mean lengths were 327 and 236, respectively. 
The differences, while once again not dramatic, are systematic and 
even sizable given only the change in presentation mode to account 
for them. 

It must be remembered that, given the degrees of discrimination 
on the rating scale and the rendering of at least threis and sometimes 
four ratings for each essay, the difference between scores of 5 and 6 
6 and 7 is not, qualitatively speaking, a minusculis one. An essay 
scoring 5 is only a marginally passing essay and had to have been 
considered incompetent by at least one rater. Ah essay s-corihg 
6 is ^ by definition, an essay of average competence and most likely 
was judged average by all three raters. An essay scoring 7 on the 
other hand is an essay of better than average competence and had to 
have been judged that way by at least one rater. 

The same can be said of the difference between essays scoring 
below and at the cutoff^ point on the scale, 5. An essay scoring 

3 is an imcompetent essay and had to have Been judged that way by 
all three raters or by two raters and the referee. An essay scoring 

4 is discrepant by definition because it was rated average by one of 
three raters and incompetent by the other two. The referee's rating 
in such a case determines in all likelihood whether the final score 

erJc 



wiii be 3 or 5 Csince there is little chance the referee wili rate 
such an essay above 2). Either way there is a high level of consensus 
about the issue of competence (or incompetence) in any individual 
essay with ah initial score of 4. Ah essay scoring 5 on the other 
hand, though only marginally competent^ still earned its score by 
virtue of convincing two raters that it is of average competence 
(except in the highly unlikely event that the referee confirms a 
rating of 3 in a ratings distribution of 1 1 3). The point to 
remember, then, is that adjacent scores oh the rating scale are the; 
result of separate ratings and so are qualitatively different. 

With respect to the influence of modes, topics, and combinations 

J ; 

thereof on writers, it' is useful to look at the numbers of essays 

scoring at and below the cutoff point. The following table presents 

this information. 

3's 5 ' s 3' s 5Vs 

Mode 1 Mode 3 

Topic 15 1 Topic 12 6 

M 2 1 4 " 2 2 5. 

"315 "329 
••4 3 3 •'423 
••5 2 8 "546 
•• 6 2 10 "628 

Mode 2 Mode I 14 31 

- " 2 9 32 

Topic 116 3 14 , 37 

"226 

"3 1 4 Topic 1 8 13 

4 17 •' 2 5 15 

" 5 3 7 •• 3 4 18 

"612 •• 4 6 13 

••5 9 21 

6 5 ZO 



Total Exam 37 100 



ERIC 



Again, Mode 2 arid Topics 6 and 3 in Mode 2 stand out as the categories 
in which the fewest writers, comparatively speaking, were judged 
iricbmpigtent or marginally competent. (Topics 2 and 3 in Mode 1 are 
notable in this regard as well, as are Topics 1 and 4 in Mode 2.) 
To the extent that these Figures represent a reasonable approximation 
of the percentages of examinees who will attain similar scores on 
the actual subtest, they are importarit. Essays scoring 3 and S 
comprise fully 38% (137 of 360} of the sample population. If one 
assumes a piroportioriate distribution of such scores among modes, 
tdpics> and cells, one would expect one-third in each mode, one-sixth 
in each tbpic^ and one- eighteenth in each cell. A glance at the 
table reveals a disproportionately low number of 3's in Mode 2 scores 
(as well as a similar condition for 3's and 5's in Topic 6 of Mode 2). 
A conclusion justifiable in terms of these data, then, is that Mode 2 
essays were less likely than others to be judged incompetent; or, to 
put it another way, the influence of Mode 2 versions of topics 
produced a stronger tendency in writers to compose competent essays 
than did other modes. Gertainly this is a tendency to consider in 
selecting a presentation mode for topics iri actual administrations 
of the writing subtest of the Florida Teacher Certification 
Examination , 

A^J^torlcal PerspBCtive on the Essays 

A sample (about 20%) of the essays was drawn randomly from the 
eighteen cells and read by the investigator and the administrative 
. assistant with an eye toward discovering any rhetorical characteristics 
or patterns that might be attributable to factors in the study. 

25 



ERIC 



The result was a series of impressions based not on specific criteria- 
as were the ratings--but rather on a sense of how the essays matched 
the expectations of the reviewers (who are, after ail, experienced 
teachers of writing) for college educated young adults writing under 
these particular test conditions. Interestingly, the two readers 
in their independent reviews agreed substantially on what they felt 
were the dominant characteristics of essays and of certain groups of 
essays. These impressions^ which follow^ are meant to provide a 
supplementary gloss on the statistical analysis reported above. 

As a group> tlie essays written on the field trial were a 
desultory lot^ distinguished chiefly, though not uniformly^ by 
blandness of expressioti^ a tendency toward dVergenerality ^ and an 
uncertain cominand of the rhetorical^ structural, and mechanical 
conventions of written English. Many essays struck the readers as 
the linguistic equivalents of photographs taken by an unsteady hand, 
the contours hazy and uncertain and the entire subject somewhat out 
of focus. The writers either failed to find what they wanted to say 
soon enough- -many rambled as if doing unstructured thinking exercises 
oh paper- -or they were uncommitted to flushing out their real feelings 
on a particular topic. In a number of instances, writers simply 
had--or chose to have--little or nothing of significance to say. 
While it is hazardous to guess what impact a lack of motivation might 
have had on these writers^ their papers did not reflect anything like 
genuine involvement in the business of writing on these topics. 
tVhether this circumstance mirrors the lack of real concern and effort 
which often attends simulated versions of experience, and whether 
a different set of characteristics will be manifested in the actual 



26 



-24- 



subtest are, at this time, strictly moot questions; For this sample 
6£ essays, mediocre is an apt though possibly overgeherous description 
of them- 

There were notable exceptions, however, the most compelling 
being those essays written in Mode 2 of Topic 6, 
fH|^|^^^BBiH^|^^^^& * Despite some variations, these essays were 
as a group better organized, more sharply focussed, and more interesting, 
lively, forthright, and personal than any other category of papers 
in the sample. As a rule, they addressed the topic more quickly than 
others, had more specific points to make about it, and were 
stylistically superior to their counterparts in the rest of the 
s amp 1 e . 



EKLC 



By design, Modes 1, 2, and 3 vary according to the amount of 
information each supplies about a topic; or, to put it another way, 
according to the degree to which each approaches a full rhetorical 
context. Mode 1 provides little information and no context whatsoever; 
Mode 2 supplies some information and an orientation to the topic; 
Mode 3 provides a good deal of information and contains all the elements 
of a full rhetorical context- - audience , purpose, form, and subject. 
All affect writers in particular ways. 

Mode 1, by virtue of its low degree of specification, challenges 
a writer first to define the topic at hand and then to say something 
about it. Such a task throws a writer on his or her own resources 
early, forcing quick decisions- -or at least accelerated thirikihg-- 
abbut what a topic means and what a writer feels about it. No 
method of organization of procedure is suggested or implied, and if 
a writer cannot bring some brjgariizatibhal principle directly to bear 

27 



^ oh an essay i it will likely founder aimlessly into waters as muddy as 
those treaded by the excerpts of essays quoted earlier. When a 
writer meets the challenge successfully, however^ the result is quite 
like a good essay in Mode 2, with the exception that it takes a bit 
longer to get in focus. 

Mode 2, unlike Mode 1, gives a writer a definite place. to begin. 
It makes a scate;acnt about a topic arid then asks for a personal 
expression of a writer's own views on it. Many writers in the sample 




used the 



statement ' in one way or another as a means of introducing their own 

positions. In fact, this seems to be the major diff erence- - perhaps 

the one critical difference- -between Mode 2 and the other presentation 

modes: it offers a ready method of organizing an essay by providing 

a kind of pre-established path along which writers may channel their 
thQjjghts on a topic. In short, it supplies enough structure for 

writers to begin writing quickly and purposefully. 

Mode 3^ the mode establishing a f ul 1 -rhetorical context, 
apparently wasn't very helpful to examinees in organizing- and 
focussing their writing. Quite a few got more caught up in the 
format required, especially when a personal letter was called for, 
than in the development of their ideas on the topic. Many failed to 
do more than rehash the information given in the scenario; perhaps 
the information given acted as a boundary rather than as the stimulus 
it was intended to be. Perhaps too the hypothetical situations were, 
because of their locus in fixed, real-world events, iriadequate 
introductions to the task of writing personal statements on issues 
conceived of originally as large-scale. That is, asking for a 
statement based on a particular event or situation may have elicited 
shallower responses than asking for a position on a general issue. 

SB 

o 

ERIC 



Recommendations 

Mode . the evidence gathered in the statistical and the rhetorical 
analyses points clearly at Mode 2 as the presentation mode likeliest 
to stimulate the best writing of large numbers of examinees. The 
Mode 2 format is thus the preferred format for the writing subtest 
of the Florida Teacher Gertifieat ion Examination, 

topic > Of the s ix topics generated and val idated in this study , 
two of them--produced essays of higher quality in the preferred mode 
than the others. It is recommended that these topics definitely be 
among those used in the first administrations of the examination. 

Rhetorical Modifications. Because of the apparent effect of Mode 2, 
topic 6 on the quality of essays^ it is recommended that all the 
topics used in the first administrations of the examination be worded 
as closely as possible like that of Mode 2^ Topic 6. Such modifications 
will bring into line the particular charge of each topic and v'ill 
offer a fairer writer-to-writer test of compositional skill. 





ERIC 



